library(readr)
library(textstem)
library(stringr)
library(tidyr)
library(dplyr)
library(stopwords)
library(tidytext)
library(tidyverse)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(syuzhet)
library(ggplot2)
library(stringi)
library(ggplot2)
library(tibble)
library(textdata)
library(Hmisc)
library(sentimentr)
library(zoo)
library(flextable)
library(tm)
library(magrittr)
library(lexicon)
library(broom)
library(wordcloud)
library(ggplot2)
library(qdap)
library(metricsgraphics)
library(ggthemes)
library(radarchart)
library(treemap)
library(ggraph)
library(lattice)
library(igraph)
library(udpipe)
library(widyr)
library(topicmodels)
library(rJava)
library(mallet)
library(LDAvis)
library(servr)
model_loaded <- udpipe::udpipe_download_model(language = "english-ewt")

1 Data Description

Our date has been specially stocked by us and is a collection of jokes and anecdotes from the site Reddit[1].

‘Reddit’ is an American social news aggregation, content rating, and discussion website. Registered users submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members. In reddit there are subreddits. A subreddit is a specific online community, and the posts associated with it, on the social media website Reddit. Subreddits are dedicated to a particular topic that people write about, and they’re denoted by /r/, followed by the subreddit’s name, e.g., /r/gaming.

In our work we regarding subreddit “jokes” from the TOP page. The texts include the jokes from 5 to 350 words in each text. The problem of parcing was that the title downloaded in different column that the main body of the text but in the most cases the title was the beginning of the joke. Foe instance, title: What do you get when you mix Napolean Dynamite and Napolean Bonaparte? and the main text: Napolean Blownapart. So, we had to join this two colums in one text and we named this column text. It will be the 1st column of our importance, Also we have the column “comments”. It will be also actively partcicipate in our work. We in advance deleted the url of the reddit and other stuff that we were not using in our analysis.

data <- read_csv("jokes_file.csv")
head(data)
## # A tibble: 6 × 5
##   date_utc   title                                         text  text …¹ comme…²
##   <date>     <chr>                                         <chr> <chr>     <dbl>
## 1 NA          <NA>                                          <NA>  <NA>        NA
## 2 2022-11-19 "What four elements are you not allowed to b… "Wha… "Nitro…      17
## 3 2022-11-19 "Why do farts smell?"                         "Why… "So de…       7
## 4 2022-11-06 "Two great opportunities are walking past a … "Two… "\nThe…       1
## 5 2022-12-05 "What was the snowman doing in the vegetable… "Wha… "Picki…       2
## 6 2022-11-22 "Check out \"conjunctivitis.com\"."           "Che… "Its a…       1
## # … with abbreviated variable names ¹​`text (without title)`, ²​comments

The date consists of 1000 posts (in our case the posts) and 5 columns (among them the date of publication, the title, the text of the joke/anecdote and the number of comments under that joke)

From that we will necessarily use the publication date, the number of comments and the text itself. Thanks to these parameters we will analyze the jokes, the content of the jokes and their popularity.

Develop a doc_id identifier of each reddit text. It will be useful to understand from which text corpus we got specific word Also we were using here na.omit function to delete all the NA the text have to not be confused with the different length problem

2 Research Question

In our article, we want to analyze how jokes with lots of comments differ from jokes with few comments. We assume word frequency for these two groups, so we divide the date into two parts. Where “dicsussive” marks those texts with lots of comments under the text, and “not discussive” marks those texts with few comments on the post. We assume that the more discussable jokes will contain more harsh and discussion-inducing words (e.g., jokes about global situations, popular people, and issues), while the less discussable posts will list more neutral words that will describe people’s everyday lives more, and build the joke around that. RQ: Do harsh words or social topics in jokes affect the number of comments or disscisivness under the post.

In this vein, we think that we can observe some differences in both groups, so here we assume:

H1: We assume that the more discussed jokes will have more sharply political and other controversial words in their corpus than the less discussed posts. This may be because people have radically different views on politics and other world issues, so they will want to express their opinions in such comments or debate with other users. [2]

H2: We assume that the most neutral or mundane words, will be more common in non-discussive posts than discussion posts. This may be because people are very similar in these areas and lifestyles, so they won’t debate in the comments and leave many comments because of it.

H3: Since more negative emotions will encourage people to write comments, there will be more words associated with negative emotions in the discussion group than in the non-dissussive group.

H4: When dividing discussive and not discussive texts into topics there will be more discussible topics and words related to politics, religion and etc in discussive group and than in not discussible one

H5: In the network analysis, it will be noticeable how words in the discussive group unite on rather controversial topics, such as politics, religion and etc while in not-discussive group there will be more daily life jokes, like family, friends, celebration etc

2.1 Develop a doc_id identifier

of each reddit text. It will be useful to understand from which text corpus we got specific word Also we used here na.omit function to delete all the NA the text have to not be confused with the different length problem and use doc_id to identify the number of each text.

data <- na.omit(data)
head(data)
## # A tibble: 6 × 5
##   date_utc   title                                         text  text …¹ comme…²
##   <date>     <chr>                                         <chr> <chr>     <dbl>
## 1 2022-11-19 "What four elements are you not allowed to b… "Wha… "Nitro…      17
## 2 2022-11-19 "Why do farts smell?"                         "Why… "So de…       7
## 3 2022-11-06 "Two great opportunities are walking past a … "Two… "\nThe…       1
## 4 2022-12-05 "What was the snowman doing in the vegetable… "Wha… "Picki…       2
## 5 2022-11-22 "Check out \"conjunctivitis.com\"."           "Che… "Its a…       1
## 6 2022-11-21 "A plane is sitting at the terminal and is s… "A p… "Depar…       2
## # … with abbreviated variable names ¹​`text (without title)`, ²​comments
# all data is cleared from N/As hooray
data <- data %>% mutate(doc_id = row_number())
# each doc_id identify the number of the each text
head(data)
## # A tibble: 6 × 6
##   date_utc   title                                  text  text …¹ comme…² doc_id
##   <date>     <chr>                                  <chr> <chr>     <dbl>  <int>
## 1 2022-11-19 "What four elements are you not allow… "Wha… "Nitro…      17      1
## 2 2022-11-19 "Why do farts smell?"                  "Why… "So de…       7      2
## 3 2022-11-06 "Two great opportunities are walking … "Two… "\nThe…       1      3
## 4 2022-12-05 "What was the snowman doing in the ve… "Wha… "Picki…       2      4
## 5 2022-11-22 "Check out \"conjunctivitis.com\"."    "Che… "Its a…       1      5
## 6 2022-11-21 "A plane is sitting at the terminal a… "A p… "Depar…       2      6
## # … with abbreviated variable names ¹​`text (without title)`, ²​comments

3 Data preporation

At this point we begin to prepare our date for analysis. First we remove values from the date such as NA, next we clean up the text data to tokenize and create freq_list to take a closer look at the most popular words in each segment.

1.Delete other strange signs which appeared in the text. I think this signs were the emojis or other figures that after parsing turned out to be these numbers within the text.

  1. Dividing our data into two groups. Especially, we will divide our jokes for those which were “discussive” and more popular and less discusiive, we will name it “not discussive”. As it will be stand out in our hypothesis, we are suggesting that these two groups will have different types of the most common words as some words will lead to discussion and people want to share their opinion about it and other will not so interesting for people so they will put lesser amount of reaction on it.(more about our hypothesis you can see in the presentation of our final work). We name the texts which have more then 100 words discussive, because there are the the mean of 72 comments per post and the histogram of comments is really right skewed that means that the outliers (or discussive texts are lying in the segments of => 100 comments per post. You can also notice that only a few comments in our histogram have more than 500 posts, these are units and it will be incorrect to compare these two groups, as we wanted to do before.

  2. I make filtering for the texts which have less then 15 words in a corpus as there are not so meaningful for our analysis and adding noise for the further research of the texts.

data <- data %>% mutate(text = stri_replace_all_regex(text,
                                  pattern=c("\n","\030", "\031", "\032", "\033", "\034", "\035", "\036", "\037", "\038", "\039", "\f", "&amp", "#x200B", "\"", "\"v\"", "\"z\"" , "\"w\"", "\"ph\"", "\"o\"", "\"e\""),
                                  replacement=c("",  "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""),
                                  vectorize=FALSE))
# deleting all the signs and strange groups of numbers that appear in further analysis

3.1 Working with comments

At this point we want to see how to statistically explain the number of comments and select those disscusive and non-dissscusive.

data <- data %>% mutate(sum = str_count(text))
# counting the number of words in each text and adding the column with these numbers
data <- data %>% filter(sum > 15)
# filtering and leaving the texts which have more than 15 words 
head(data)
## # A tibble: 6 × 7
##   date_utc   title                            text  text …¹ comme…² doc_id   sum
##   <date>     <chr>                            <chr> <chr>     <dbl>  <int> <int>
## 1 2022-11-19 "What four elements are you not… What… "Nitro…      17      1   123
## 2 2022-11-19 "Why do farts smell?"            Why … "So de…       7      2    53
## 3 2022-11-06 "Two great opportunities are wa… Two … "\nThe…       1      3   104
## 4 2022-12-05 "What was the snowman doing in … What… "Picki…       2      4    67
## 5 2022-11-22 "Check out \"conjunctivitis.com… Chec… "Its a…       1      5    54
## 6 2022-11-21 "A plane is sitting at the term… A pl… "Depar…       2      6   418
## # … with abbreviated variable names ¹​`text (without title)`, ²​comments
hist(data$comments,
    ylim=c(0, 1100), 
    col="lightblue1",
     border="dodgerblue3",
    labels=TRUE, 
     main="Distribution of comments per texts",
     xlab="Quantity of comments")  
abline(v=mean(faithful$waiting),
       col="dodgerblue3",
       lty=2,
       lwd=2)

So, we see that a large number of posts have no comments at all, so to divide the date into two groups, we decided that discussive posts are those with more than 100 comments, and non-discussive are those with less than 100 comments. So our date is divided into almost equal documents. Below we have divided the date and tokenized by words in the text.

data$Sentiment <- with(data, ifelse(comments > 100, 'discussive', 'not discussive'))
#dividing the texts into two groups via the quantity of comments. We will explain why in the following project

# At this stage, we removed the stop words from both groups and left only the words we needed. Further in the analysis process, we also noticed that among the useful and relevant data, verbs and numbers still appear in the list of top words, so we decided to clean up the date and remove these words.

datadiss <- filter(data, Sentiment == "discussive")
datanotdiss <- filter(data, Sentiment == "not discussive")

data_words1 <- datadiss %>% dplyr::select(Sentiment, text) %>% tidytext::unnest_tokens(word, text)


data_words2 <- datanotdiss %>% dplyr::select(Sentiment, text) %>% tidytext::unnest_tokens(word, text)

4 Main Part

In the Main Part we will start our analysis and as different methods that fit our hypotheses we have chosen: Network analisis, Sentimental Analisis and Lexical Statistics.

4.1 Frequency list and list of top words

At this stage, we want to do a primary analysis of the word groups and see which words occur frequently in both groups, make a list of the top words and visualize this with graphs. To begin with, we divided the date into 2 separate docs, where one presents the tests of only the group “Dissusive” and the other “Not dissusive”.After partitioning, we made tokens from each date for both groups.

datadiss <- filter(data, Sentiment == "discussive")
datanotdiss <- filter(data, Sentiment == "not discussive")

data_words1 <- datadiss %>% dplyr::select(Sentiment, text) %>% tidytext::unnest_tokens(word, text)
head(data_words1) 
## # A tibble: 6 × 2
##   Sentiment  word   
##   <chr>      <chr>  
## 1 discussive a      
## 2 discussive redneck
## 3 discussive is     
## 4 discussive pulled 
## 5 discussive over   
## 6 discussive by
data_words2 <- datanotdiss %>% dplyr::select(Sentiment, text) %>% tidytext::unnest_tokens(word, text)
head(data_words2)
## # A tibble: 6 × 2
##   Sentiment      word    
##   <chr>          <chr>   
## 1 not discussive what    
## 2 not discussive four    
## 3 not discussive elements
## 4 not discussive are     
## 5 not discussive you     
## 6 not discussive not

At this point we removed the stop words from both groups and left only the words we needed. Further on in the process of analysis we noticed that among the useful and relevant data I have more verbs and numbers in the list of top words, so we decided to clean up the date and remove these words.

Here you can see how custom stop words are created and added to stop_words.

stopwords("en")
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"
#Here you can see how custom stop words are created and added to stop_words.

custom_stop_words <- tribble(
  #Matching column names like in stop_words
  ~word,  ~lexicon,
  # Add http, win, and t.co as custom stop words
  "says", "CUSTOM",
  "told", "CUSTOM",
  "guy", "CUSTOM",
  "day", "CUSTOM",
  "replies", "CUSTOM",
  "replied", "CUSTOM",
  "first",  "CUSTOM",
   "walks",  "CUSTOM",
  "hey", "CUSTOM",
  "kid", "CUSTOM",
  "said", "CUSTOM",
  "two", "CUSTOM",
  "tells", "CUSTOM",
  "takes", "CUSTOM", 
  "x200b", "CUSTOM",
  "wife", "CUSTOM", 
  "doctor", "CUSTOM", 
  "woman", "CUSTOM", 
  "car", "CUSTOM", 
  "time", "CUSTOM", 
  "priest", "CUSTOM",
  "f150", "CUSTOM"
)


# Bind the custom stop words to stop_words
stop_words2 <- stop_words %>% 
  bind_rows(custom_stop_words)

head(stop_words2)
## # A tibble: 6 × 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART
enstopwords <- data.frame(word=stopwords("en"), stringsAsFactors=FALSE)
data.nonstopd <- data_words1 %>%
    anti_join(stop_words2)


enstopwords <- data.frame(word=stopwords("en"), stringsAsFactors=FALSE)
data.nonstopnd <- data_words2 %>%
    anti_join(stop_words2)

data.nonstopd=data.nonstopd%>% 
    filter(!str_detect(word, "[0-9]+"))

data.nonstopnd=data.nonstopnd%>% 
    filter(!str_detect(word, "[0-9]+"))

4.2 Disscussive group

4.2.1 Corpus statistics

Then we decided to create a frequency list, calculated the total size of the corpus and Vocabulary size, then selected the 30 most popular and most frequently encountered words and made a plot to distribute the most frequently encountered words. First we did this for the disscussive group, and then we did it for the non-disscussive group.

So we got that for the disscussive group the total size of the corpus = 4916 and Vocabulary size = 2138.

freqlistdiss <- data.nonstopd %>% count(word, sort=TRUE)
freqlistdiss
## # A tibble: 2,128 × 2
##    word       n
##    <chr>  <int>
##  1 father    22
##  2 peter     21
##  3 monk      19
##  4 home      18
##  5 call      17
##  6 hell      17
##  7 rolls     17
##  8 door      16
##  9 people    16
## 10 dog       15
## # … with 2,118 more rows
sum(freqlistdiss$n)
## [1] 3916
nrow(freqlistdiss)
## [1] 2128
freq_top30d <- freqlistdiss %>% top_n(30)
head(freq_top30d)
## # A tibble: 6 × 2
##   word       n
##   <chr>  <int>
## 1 father    22
## 2 peter     21
## 3 monk      19
## 4 home      18
## 5 call      17
## 6 hell      17

We also created a plot of the frequency distribution of the use of various words in the disscussive group. This graph shows that among the top 30 frequently used words there are both ordinary words (such as guy, rolls, kia), and those that can cause a person to leave a comment, that is, quite debatable (lover, chinese, jewish, chuck).

ggplot(freq_top30d, aes(reorder(word, -n), n)) +
  geom_point(stat = "identity") +
  geom_line(group = 1) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.7)) +
  xlab("Words") +
  ylab("Frequency") +
  ggtitle("Distribution of the most frequent words in disscussive Sentiment")

You can see that the frequency in list of discussive words is much less. This would have been discussed earlier, this is because, in general, the size of the texts in the discussive lists is much smaller. Here is an interesting find. The word father comes first on the list of discussive words, and mother comes first on the list of not-discussive words. That is, most likely, this suggests that for people in our society, dad is something more discussed and everyone has a different opinion about him in life. This may be because a large percentage of people in the world grow up either without a father or with a stepfather, so the picture of the father is different for everyone and will share the discussion. In the case of mom, it seems that this is something more stable and understandable, and people do not discuss it so clearly. Here we already see the appearance of frequent words, which are really subject to discussion. For example hell, jesus. monk. Another interesting observation is that two male names can be found in the list of common words for the discussive group: Peter and Brian. Perhaps these are the heroes of some films or TV shows, so they ended up in this top list.

Wordcloud helps to make our assumtions more elustrative

library(wordcloud)
data.nonstopd %>%
    dplyr::count(word) %>%
    with(wordcloud(word, n, min.freq = 6,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2")))

data_freqdiss <- freqlistdiss %>% mutate(rank = row_number())
data_freqdiss
## # A tibble: 2,128 × 3
##    word       n  rank
##    <chr>  <int> <int>
##  1 father    22     1
##  2 peter     21     2
##  3 monk      19     3
##  4 home      18     4
##  5 call      17     5
##  6 hell      17     6
##  7 rolls     17     7
##  8 door      16     8
##  9 people    16     9
## 10 dog       15    10
## # … with 2,118 more rows

4.3 Not Disscussive group

4.3.1 Corpus Statistics

We did the same with the not disscussive group. So it turned out that for the disscussive group, the total size of the corpus = 18892 and Vocabulary size = 7109. We understand that the date sizes of the discussive and not discussive groups differ markedly. We need to take this into account in further research.

freqlistnotdiss <- data.nonstopnd %>% count(word, sort=TRUE)
freqlistnotdiss
## # A tibble: 7,109 × 2
##    word        n
##    <chr>   <int>
##  1 mother     57
##  2 sex        50
##  3 father     48
##  4 husband    46
##  5 people     46
##  6 boy        45
##  7 call       45
##  8 door       45
##  9 home       44
## 10 dad        40
## # … with 7,099 more rows
sum(freqlistnotdiss$n)
## [1] 18892
nrow(freqlistnotdiss)
## [1] 7109
freq_top30nd <- freqlistnotdiss %>% top_n(30)
head(freq_top30nd)
## # A tibble: 6 × 2
##   word        n
##   <chr>   <int>
## 1 mother     57
## 2 sex        50
## 3 father     48
## 4 husband    46
## 5 people     46
## 6 boy        45

We also created a plot of the frequency of use of various words in the non-discussive group. This graph shows that among the top 30 frequently used words there are both common words (such as day, wife, guy, car, father, mother, etc.) and interesting words (such as sex, money, time).

ggplot(freq_top30nd, aes(reorder(word, -n), n)) +
  geom_point(stat = "identity") +
  geom_line(group = 1) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.7)) +
  xlab("Words") +
  ylab("Frequency") +
  ggtitle("Distribution of the most frequent words in non-disscussive Sentiment")

We can see that in this case the most discussed word is anything that is somehow related to daily activities: mother, father, sex, husband, call, etc. This means that our hypothesis is that in posts that we have defined as not discussive there will be words that are neutral for a person and everyone will encounter this in their routine. Most likely, these are jokes about some comic situations in everyday life, but we can only trace this on topics, so this graph gave us a partial understanding that our hypothesis has the right to life.

This is wordcloud for not discusive. Worldcloud helps us visualize our data and see in a simpler image the frequency words in our list, as well as those that are frequency, but not enough to make it onto the graph. To do this, we identified different frequency groups that describe non-discussive posts and colored them with a specific color.

library(wordcloud)
data.nonstopnd %>%
    dplyr::count(word) %>%
    with(wordcloud(word, n, min.freq = 6,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2")))

data_freqnotdiss <- freqlistnotdiss %>% mutate(rank = row_number())
data_freqnotdiss
## # A tibble: 7,109 × 3
##    word        n  rank
##    <chr>   <int> <int>
##  1 mother     57     1
##  2 sex        50     2
##  3 father     48     3
##  4 husband    46     4
##  5 people     46     5
##  6 boy        45     6
##  7 call       45     7
##  8 door       45     8
##  9 home       44     9
## 10 dad        40    10
## # … with 7,099 more rows

Conclusion: Given a wordcloud and a plot that shows the most popular words in each group, namely in disscussive and non-dissussive we can our first hypothesis “H1: We assume that the more discussed jokes will have more sharply political and other controversial words in their corpus than the less discussed posts.” we partially reject because we did not observe any serious difference between the two groups. But since we had the second hypothesis “We assume that the most neutral or mundane words, will be more common in non-discussive posts than discussion posts” and we partially confirm it, because if you remember what we had in rafts and in wordcloud, it turns out that in both groups neutral words are common.

5 Sentimental Analysis

At this point we want to test our 3rd hypothesis and find out what is the emotional percentage of words in both groups. For this we decided to use Sentimental analysis of the texts. We took our clean date (without stop words), combined it into one table, and added a library that determines by words what emotion a word carries.

dissandnondiss <- rbind(data.nonstopd,data.nonstopnd) %>%
  dplyr::group_by(Sentiment) %>%
  dplyr::mutate(words = n()) %>%
  dplyr::left_join(tidytext::get_sentiments("nrc")) %>%
  dplyr::mutate(word = factor(word),
         sentiment = factor(sentiment))
head(dissandnondiss)
## # A tibble: 6 × 4
## # Groups:   Sentiment [1]
##   Sentiment  word      words sentiment
##   <chr>      <fct>     <int> <fct>    
## 1 discussive redneck    3916 <NA>     
## 2 discussive pulled     3916 <NA>     
## 3 discussive policeman  3916 fear     
## 4 discussive policeman  3916 positive 
## 5 discussive policeman  3916 trust    
## 6 discussive policeman  3916 fear

Next, we added a column that counts the percentage of how many times a word occurs in each group.

percentall <- dissandnondiss %>%
  dplyr::group_by(Sentiment) %>%
  dplyr::group_by(word, sentiment,Sentiment) %>%
  dplyr::summarise(sentiment = unique(sentiment),
                   sentiment_freq = n(),
                   words = unique(words)) %>%
  dplyr::filter(is.na(sentiment) == F) %>%
  dplyr::mutate(percentage = round(sentiment_freq/words*100, 1))

head(percentall)
## # A tibble: 6 × 6
## # Groups:   word, sentiment [5]
##   word         sentiment Sentiment      sentiment_freq words percentage
##   <fct>        <fct>     <chr>                   <int> <int>      <dbl>
## 1 absence      fear      discussive                  1  3916          0
## 2 absence      negative  discussive                  1  3916          0
## 3 absence      sadness   discussive                  1  3916          0
## 4 absolute     positive  discussive                  1  3916          0
## 5 absolute     positive  not discussive              1 18892          0
## 6 accidentally surprise  discussive                  1  3916          0

And here we have just the graph of the emotions, which shows as a percentage how many times an emotion occurs in each group. Here we can already see that in the first graph the discussive group (orange color) gets used to the negative emotions (anger, antipathy, disgust, sadness and fear) of the non-discussive group (black color), but on the other hand we can observe also that in positive emotions such as trust, joy and surprise the results are very disputable, so we decided to look further where negative emotions and positive ones prevail.

In the second graph we see the same thing: the disscussive group strongly predominates in negative and aggressive emotions than the non-dissussive group, but on the other hand also the disscussive group predominates in positive emotions like trust and anticipation.

percentall %>%
  dplyr::filter(sentiment != "positive",
         sentiment != "negative") %>%
  ggplot(aes(sentiment, percentage, fill = Sentiment)) +    
  geom_bar(stat="identity",   
           position=position_dodge()) + 
  scale_fill_manual(name = "", values=c("orange", "grey30")) +
  theme_bw() +
  theme(legend.position = "top")

percentall %>%
  dplyr::filter(sentiment != "positive",
         sentiment != "negative") %>%
  dplyr::mutate(sentiment = factor(sentiment, 
                            levels = c("anger", "fear", "disgust", "sadness",
                                   "surprise", "anticipation", "trust", "joy"))) %>%
  ggplot(aes(Sentiment, percentage, fill = sentiment)) +    
  geom_bar(stat = "identity", position=position_dodge()) + 
  scale_fill_brewer(palette = "RdBu") +
  theme_bw() +
  theme(legend.position = "right") +
  coord_flip()

Next, we decided to look at which words in each group in each emotion category contributed to the emotion score. In other words, we will explore which words are most important for emotion scores in each group. For ease of interpretation, we will remove several major emotion categories as well as polarity.

posts_impw <- dissandnondiss %>%
  dplyr::filter(!is.na(sentiment),
         sentiment != "anticipation",
         sentiment != "surprise",
         sentiment != "disgust",
         sentiment != "negative",
         sentiment != "sadness",
         sentiment != "positive") %>%
  dplyr::mutate(sentiment = factor(sentiment, levels = c("anger", "fear",  "trust", "joy"))) %>%
  dplyr::group_by(Sentiment) %>%
  dplyr::count(word, sentiment, sort = TRUE) %>%
  dplyr::group_by(Sentiment, sentiment) %>%
  dplyr::top_n(3) %>%
  dplyr::mutate(score = n/sum(n))
posts_impw %>%
  dplyr::group_by(Sentiment) %>%
  slice_max(score, n = 20) %>%
  dplyr::arrange(desc(score)) %>%
  dplyr::ungroup() %>%
  ggplot(aes(x = reorder(word, score), y = score, fill = word)) +
  facet_wrap(Sentiment~sentiment, ncol = 4, scales = "free_y") +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(x = "Words")

Thus we see that in the group discussive d in the categories anger and fear the words like hell, money, lion, bad and hell are repeated in the other categories. In the categories trust and joy more influence came from the words father, mother, love, monk. We see the opposite in the non-discussive group, where in the category anger the most popular word is bear, money and also bad; in the category fear the greatest contribution was made by bear, god, bad and so on.

In general, it is noticeable that in both categories words like god, mother, doctor, love, bad appear, and we decided to analyze the two groups further.

5.1 Polarity exploration

Next, we wanted to take a closer look at the polarities of each group, for which we need to prepare a date. And show the polarity of the words. To begin, we’ll do an overview of the non-discussive group, then we’ll do an overview of the non-discussive group.

5.1.1 Discussive group

To begin with, we counted the polarity in the disscussive group, that is, the mood of the group. As you can see we have calculated another statistical indicator, namely the standard deviationb standart mean, the average number of polarity, which will be useful for visualizing the raft below.

datadiss %$% polarity(text)
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all             153       11410        0.001       0.303              0.002
dataDiss2 <- datadiss %$% polarity(text, Sentiment)

Все это нам надо было для того чтобы построить плот полярности в группе disscussive.

plot(dataDiss2)

Then we decided to build a graph that distributes the most popular words in each Sentiment category for clarity and we got just such a graph. It clearly shows which words are most often used in each sentiment. For example in the disgust group we have the most popular word - hell.

nrc_lex <- get_sentiments("nrc")
nrc_lex %>% count(sentiment)
## # A tibble: 10 × 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         1245
##  2 anticipation   837
##  3 disgust       1056
##  4 fear          1474
##  5 joy            687
##  6 negative      3316
##  7 positive      2308
##  8 sadness       1187
##  9 surprise       532
## 10 trust         1230
diss_tidy <- data.nonstopd %>%
  # Inner join to nrc lexicon
  inner_join(nrc_lex, by = c("word" = "word")) %>% 
  # Drop positive or negative
  filter(!grepl("positive|negative", sentiment)) %>% 
  # Count by sentiment and term
  count(sentiment, word) %>% 
  # Spread sentiment, using n for values
  spread(sentiment, n, fill = 0)  %>% 
  # Convert to data.frame, making term the row names
  data.frame(row.names = "word")
comparison.cloud(diss_tidy, max.words = 50, title.size = 1.5)

5.1.2 non-Discussive group

To begin with, we again counted the polarity in the non-disscussive group, that is, the mood of the group. As you can see we have calculated another statistical indicator, namely the standard deviationb standart mean, the average number of polarity, which will be useful for visualizing the raft below.

datanotdiss %$% polarity(text)
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all             847       56228       -0.023       0.256             -0.088
datanotDiss3 <- datanotdiss %$% polarity(text, Sentiment)

This is the graph we have for the non-dissussive group, we just see that we have fewer sentences compared to the graph above and they are more crowded.

plot(datanotDiss3)

And for the non-dissussive group, we have a wordcloud where we also clearly see different words from each category. For example, our most popular word associated with anger is bear.

nrc_lex <- get_sentiments("nrc")
nrc_lex %>% count(sentiment)
## # A tibble: 10 × 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         1245
##  2 anticipation   837
##  3 disgust       1056
##  4 fear          1474
##  5 joy            687
##  6 negative      3316
##  7 positive      2308
##  8 sadness       1187
##  9 surprise       532
## 10 trust         1230
nondiss_tidy <- data.nonstopnd %>%
  # Inner join to nrc lexicon
  inner_join(nrc_lex, by = c("word" = "word")) %>% 
  # Drop positive or negative
  filter(!grepl("positive|negative", sentiment)) %>% 
  # Count by sentiment and term
  count(sentiment, word) %>% 
  # Spread sentiment, using n for values
  spread(sentiment, n, fill = 0)  %>% 
  # Convert to data.frame, making term the row names
  data.frame(row.names = "word")

# Examine
head(nondiss_tidy)
##              anger anticipation disgust fear joy sadness surprise trust
## abandon          0            0       0    1   0       1        0     0
## abbot            0            0       0    0   0       0        0     2
## abuse            1            0       1    1   0       1        0     0
## accident         0            0       0    4   0       4        4     0
## accidentally     0            0       0    0   0       0        8     0
## accomplish       0            0       0    0   1       0        0     0
comparison.cloud(nondiss_tidy, max.words = 50, title.size = 1.5)

Сonclusion: onsidering all our analysis above we can confirm our hypothesis about emotions, because we can clearly see that in the disscussive group and polarity is higher and there are more words that are somehow related to aggression and negative emotions. If we recall again our hypothesis “Since more negative emotions will encourage people to write comments, there will be more words associated with negative emotions in the discussion group than in the non-dissussive group.” and recall the resulting rafts, it turns out that the disscussive group had more negatively associated words than the non-dissussive group.

6 Network

6.1 Relationships between words N-Gram

6.1.1 Relationships between words N-Gram without discussive posts

First we create bigrams

bigrams <- datadiss %>% 
  unnest_tokens(word, text, token = "ngrams", n = 2) %>% 
  separate(word, c("word1", "word2"), sep = " ")
head(bigrams)
## # A tibble: 6 × 9
##   date_utc   title              text …¹ comme…² doc_id   sum Senti…³ word1 word2
##   <date>     <chr>              <chr>     <dbl>  <int> <int> <chr>   <chr> <chr>
## 1 2022-11-20 A redneck is pull… "Polic…     144    604    84 discus… a     redn…
## 2 2022-11-20 A redneck is pull… "Polic…     144    604    84 discus… redn… is   
## 3 2022-11-20 A redneck is pull… "Polic…     144    604    84 discus… is    pull…
## 4 2022-11-20 A redneck is pull… "Polic…     144    604    84 discus… pull… over 
## 5 2022-11-20 A redneck is pull… "Polic…     144    604    84 discus… over  by   
## 6 2022-11-20 A redneck is pull… "Polic…     144    604    84 discus… by    a    
## # … with abbreviated variable names ¹​`text (without title)`, ²​comments,
## #   ³​Sentiment

Making tf_idf for bugrams. The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents

bigrams <- datadiss %>%
  unnest_tokens(bigram, text, token = "ngrams", n=2)

bigrams %>%
  count(bigram, sort = TRUE)
## # A tibble: 8,263 × 2
##    bigram       n
##    <chr>    <int>
##  1 of the      51
##  2 in the      45
##  3 to the      45
##  4 and the     27
##  5 do you      25
##  6 on the      24
##  7 and says    21
##  8 in a        21
##  9 at the      19
## 10 the man     17
## # … with 8,253 more rows
bigram_tf_idf <- bigrams %>%
  count(doc_id, bigram) %>%
  bind_tf_idf(bigram, doc_id, n) %>%
  arrange(desc(tf_idf))

head(bigram_tf_idf)
## # A tibble: 6 × 6
##   doc_id bigram                n    tf   idf tf_idf
##    <int> <chr>             <int> <dbl> <dbl>  <dbl>
## 1    892 3 words:lazy          1 0.25   5.03  1.26 
## 2    892 describe yourself     1 0.25   5.03  1.26 
## 3    892 in 3                  1 0.25   5.03  1.26 
## 4    892 yourself in           1 0.25   5.03  1.26 
## 5    805 with t                2 0.167  5.03  0.838
## 6    945 my coffee             2 0.167  4.34  0.723
bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

6.2 Visualising a network of bigrams in discussive group

R is able to visualise the relationships among words simultaneously. We can arrange the words into a network with a combination of nodes. The graphs have 3 variables:

From: The node of an edge is coming from To: the node an edge is going towards Weight: A numeric value associated with each edge.

On the graph we can see the most common word combination in discussive texts. In general, there is a sense in each of the collocations. For example, st Peter, pilot in airbus, rolls royce, hot dog. You can also see the combination of the words chenese and jewish guy. This means that there are combinations with both chenese guy and jewish guy.

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_graph <- bigram_counts %>%
  filter(n > 2) %>%
  graph_from_data_frame() 

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label=name), vjust = 1, hjust = 1)

This is a more detailed chart with directionality. Here you can see from which word to which word bigrams go. That is, 1 word (from which the arrow comes) is the word with which the bigram begins, and the second word (which includes the arrow) is where the bigram ends.

Example: front -> door chenese -> guy <- jewish

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

6.3 Pairwise Correlation for discussive group

The pairwise. correlation function plots an image with the pairwise correlation between phenotypes and provides the corresponding source matrix. he pairwise comparison method (sometimes called the ‘paired comparison method’) is a process for ranking or choosing from a group of alternatives by comparing them against each other in pairs, i.e. two alternatives at a time. Most of all I liked the combination of politics and Satan

datadiss <- datadiss %>% tidytext::unnest_tokens(word, text)
wosw_tidy <- datadiss %>%
  anti_join(get_stopwords()) %>%
  filter(is.na(as.numeric(word)))

wosw_tidy %>%
  select(doc_id, word) %>%
  head(10)
## # A tibble: 10 × 2
##    doc_id word     
##     <int> <chr>    
##  1    604 redneck  
##  2    604 pulled   
##  3    604 policeman
##  4    604 policeman
##  5    604 got      
##  6    604 id       
##  7    604 redneck  
##  8    605 removed  
##  9    605 reddit   
## 10    605 removed
word_cors <- wosw_tidy %>%
  group_by(word) %>%
  filter(n() >= 5) %>%
  pairwise_cor(word, doc_id, sort = TRUE)

word_cors
## # A tibble: 63,756 × 3
##    item1       item2       correlation
##    <chr>       <chr>             <dbl>
##  1 sisters     sign                  1
##  2 mercy       sign                  1
##  3 sign        sisters               1
##  4 mercy       sisters               1
##  5 sign        mercy                 1
##  6 sisters     mercy                 1
##  7 millionaire crocodile             1
##  8 brian       crocodile             1
##  9 crocodile   millionaire           1
## 10 brian       millionaire           1
## # … with 63,746 more rows

The output above is useful for further exploration. Lets have a look into the terms “tmax”.

word_cors %>%
  filter(correlation > .75) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

# Topic modelling in discussive group with LDA Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words.

dtm <- DocumentTermMatrix(data.nonstopd$word)
dtm <- na.omit(dtm)
#create model fit
raw.sum=apply(dtm,1,FUN=sum)
dtm=dtm[raw.sum!=0,]
#Taking 3 topics
ap_lda <- LDA(dtm, k = 3, control = list(seed = 11091987))
ap_lda
## A LDA_VEM topic model with 3 topics.

Now we’ve fitted a LDA model. We have to now explore and interpret the model.

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
## # A tibble: 6,315 × 3
##    topic term           beta
##    <int> <chr>         <dbl>
##  1     1 redneck   0.000523 
##  2     2 redneck   0.000386 
##  3     3 redneck   0.000644 
##  4     1 pulled    0.00205  
##  5     2 pulled    0.000221 
##  6     3 pulled    0.000831 
##  7     1 policeman 0.0000171
##  8     2 policeman 0.000741 
##  9     3 policeman 0.000794 
## 10     1 removed   0.00153  
## # … with 6,305 more rows
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

This visualization lets us understand the 3 topics that were extracted from the corpus. The most common words in topic 1 include “hell”, “father”, “home”, which suggests it may represent relogion jokes. Those most commong in topic 2 include “father”, “door”, “Tpeter” suggesting that this topic represents jokes about religious morning.. Those most common in topic 3 include “monk”, “dog”, and “pilot”, suggesting that this topic represents family life jokes. One important observation about the words in each topic is that some words, such as “monk” and “father”, are common within two topics.

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001 | topic3 > 0.001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread
## # A tibble: 390 × 5
##    term         topic1   topic2    topic3 log_ratio
##    <chr>         <dbl>    <dbl>     <dbl>     <dbl>
##  1 absolutely 0.000497 0.000550 0.00128       0.146
##  2 ago        0.000323 0.00161  0.00117       2.32 
##  3 aiden      0.00106  0.000434 0.000836     -1.28 
##  4 air        0.000679 0.000595 0.00105      -0.191
##  5 airbus     0.00341  0.00238  0.00120      -0.518
##  6 alive      0.00110  0.000627 0.000597     -0.818
##  7 american   0.00108  0.000980 0.000269     -0.140
##  8 answer     0.00112  0.000880 0.000327     -0.351
##  9 answered   0.00124  0.00211  0.00209       0.773
## 10 answers    0.00168  0.000636 0.0000162    -1.40 
## # … with 380 more rows

6.4 Relationships between words N-Gram

6.4.1 Relationships between words N-Gram without NOT discussive posts

First we create bigrams

bigrams <- datanotdiss %>%  
  unnest_tokens(bigram, text, token = "ngrams", n=2)

bigrams %>%
  count(bigram, sort = TRUE)
## # A tibble: 33,194 × 2
##    bigram       n
##    <chr>    <int>
##  1 to the     209
##  2 in the     199
##  3 of the     185
##  4 and the    129
##  5 on the     121
##  6 at the     113
##  7 do you     106
##  8 the man    102
##  9 and says    86
## 10 for a       82
## # … with 33,184 more rows
bigram_tf_idf <- bigrams %>%
  count(doc_id, bigram) %>%
  bind_tf_idf(bigram, doc_id, n) %>%
  arrange(desc(tf_idf))

head(bigram_tf_idf)
## # A tibble: 6 × 6
##   doc_id bigram            n    tf   idf tf_idf
##    <int> <chr>         <int> <dbl> <dbl>  <dbl>
## 1    653 didn't jog       12 0.279  6.74   1.88
## 2    496 chuck norris      2 0.25   6.74   1.69
## 3    589 can't buy         2 0.222  6.74   1.50
## 4    162 irony wrinkly     1 0.2    6.74   1.35
## 5    162 of irony          1 0.2    6.74   1.35
## 6    162 opposite of       1 0.2    6.74   1.35

6.5 Visualising a network of bigrams in discussive group

R is able to visualise the relationships among words simultaneously. We can arrange the words into a network with a combination of nodes. The graphs have 3 variables:

From: The node of an edge is coming from To: the node an edge is going towards Weight: A numeric value associated with each edge.

On the graph we can see the most common word combination in discussive texts. In general, there is a sense in each of the collocations. For example, tic tacks, shop owners. You can also see the combination of the words half and cookie lemon. This means that there are combinations with both cookie guy and half lemon. The word stab is reflected by a single dot, this is because the word stab is used with the word stab most often

bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_graph <- bigram_counts %>%
  filter(n > 4) %>%
  graph_from_data_frame() 

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label=name), vjust = 1, hjust = 1)

This is a more detailed chart with directionality. Here you can see from which word to which word bigrams go. That is, 1 word (from which the arrow comes) is the word with which the bigram begins, and the second word (which includes the arrow) is where the bigram ends.

Example: bad -> news half -> lemon -> cookie

If we compare the two groups we have identified, it also confirms the hypothesis #5 that most often in the corpora of the discussive group there are bigrams with religious overtones: St Peter, virgins Mary and etc, when the words in the not discussive group are more neutral

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

6.6 Pairwise Correlation for NOT discussive group

The pairwise. correlation function plots an image with the pairwise correlation between phenotypes and provides the corresponding source matrix. he pairwise comparison method (sometimes called the ‘paired comparison method’) is a process for ranking or choosing from a group of alternatives by comparing them against each other in pairs, i.e. two alternatives at a time.

datanotdiss <- datanotdiss %>% tidytext::unnest_tokens(word, text)
wosw_tidy <- datanotdiss %>%
  anti_join(get_stopwords()) %>%
  filter(is.na(as.numeric(word)))

wosw_tidy %>%
  select(doc_id, word) %>%
  head(10)
## # A tibble: 10 × 2
##    doc_id word    
##     <int> <chr>   
##  1      1 four    
##  2      1 elements
##  3      1 allowed 
##  4      1 bring   
##  5      1 job     
##  6      1 nitrogen
##  7      1 sulfur  
##  8      1 fluorine
##  9      1 tungsten
## 10      1 nsfw
word_cors <- wosw_tidy %>%
  group_by(word) %>%
  filter(n() >= 5) %>%
  pairwise_cor(word, doc_id, sort = TRUE)

word_cors
## # A tibble: 1,438,800 × 3
##    item1      item2      correlation
##    <chr>      <chr>            <dbl>
##  1 birch      beech                1
##  2 woodpecker beech                1
##  3 beech      birch                1
##  4 woodpecker birch                1
##  5 beech      woodpecker           1
##  6 birch      woodpecker           1
##  7 hairy      sophie               1
##  8 sophie     hairy                1
##  9 thash      shee                 1
## 10 shee       thash                1
## # … with 1,438,790 more rows

The output above is useful for further exploration. Lets have a look into the terms “tmax”.

word_cors %>%
  filter(correlation > .95) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

dtm <- DocumentTermMatrix(data.nonstopnd$word)
dtm <- na.omit(dtm)
#create model fit
raw.sum=apply(dtm,1,FUN=sum)
dtm=dtm[raw.sum!=0,]
ap_lda <- LDA(dtm, k = 3, control = list(seed = 11091987))
ap_lda
## A LDA_VEM topic model with 3 topics.

Now we’ve fitted a LDA model. We have to now explore and interpret the model.

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
## # A tibble: 21,180 × 3
##    topic term           beta
##    <int> <chr>         <dbl>
##  1     1 elements 0.0000850 
##  2     2 elements 0.0000719 
##  3     3 elements 0.00000318
##  4     1 allowed  0.000563  
##  5     2 allowed  0.000522  
##  6     3 allowed  0.000196  
##  7     1 bring    0.000140  
##  8     2 bring    0.00223   
##  9     3 bring    0.000512  
## 10     1 job      0.00178   
## # … with 21,170 more rows
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

This visualization lets us understand the 3 topics that were extracted from the corpus. The most common words in topic 1 include “home”, “house”, “dad”, which suggests it may represent jokes about family. Those most common in topic 2 include “mother”, “started”, “people” suggesting that this topic represents jokes about mother.Those most common in topic 3 include “boy”, “hand”, and “farmer”, suggesting that this topic represents jokes about life of the boy.

Compared to the discussive group, you can see that the topics in the NOT discusiive group are more neutral and mediocre. There are no such diverse topics as religion, for example. Of course, it is also interesting that we did not come across a topic that is not related to politics in the discussive group. However, one can notice the words about politicians in the visualizations of Bigram in the discussive group

In this case, we confirm our 4th hypothesis, namely, in our model it is true that when we dividing discussive and not discussive texts into topics there will be more discussible topics and words related to politics, religion and etc in discussive group and than in not discussible one, as I wrote above

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001 | topic3 > 0.001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread
## # A tibble: 235 × 5
##    term          topic1    topic2    topic3 log_ratio
##    <chr>          <dbl>     <dbl>     <dbl>     <dbl>
##  1 afraid     0.00111   0.0000706 0.000742     -3.97 
##  2 american   0.000571  0.00103   0.00128       0.850
##  3 answers    0.00111   0.000167  0.00113      -2.73 
##  4 apple      0.00141   0.000296  0.00117      -2.26 
##  5 approached 0.0000784 0.000234  0.00113       1.58 
##  6 baby       0.00114   0.0000838 0.000377     -3.77 
##  7 bad        0.00276   0.00211   0.0000914    -0.391
##  8 ball       0.00259   0.00136   0.000859     -0.926
##  9 band       0.000105  0.00137   0.000282      3.71 
## 10 bang       0.00106   0.00150   0.000485      0.503
## # … with 225 more rows

7 Conclusion:

In our work we used various methods of text analysis, through which we were able to fully explore the text and confirm/disprove the hypotheses that we set out at the beginning of the work. Initially, we assumed that the discussion group would have more topics of doubt for people, compared to the text from the NOT discussion group. We proved this and confirmed the hypothesis using the thematic modeling method when we took 3 topics each from the two groups for analysis and analyzed the words that occur in those topics. We also hypothesized that the discussion group would have more bigrams with more religious or other context that not all people understand. We were able to confirm this hypothesis through network analysis.

Here is again a list of our hypotheses: H1: We assume that the more discussed jokes will have more sharply political and other controversial words in their corpus than the less discussed posts. This may be because people have radically different views on politics and other world issues, so they will want to express their opinions in such comments or debate with other users. - Rejected

H2: We assume that the most neutral or mundane words, will be more common in non-discussive posts than discussion posts. This may be because people are very similar in these areas and lifestyles, so they won’t debate in the comments and leave many comments because of it. - partially confirmed

H3: Since more negative emotions will encourage people to write comments, there will be more words associated with negative emotions in the discussion group than in the non-dissussive group. - confirmed

H4: When dividing discussive and not discussive texts into topics there will be more discussable topics and words related to politics, religion and etc in discussive group and than in not discussable one - confirmed

H5: In the network analysis, it will be noticeable how words in the discussive group unite on rather controversial topics, such as politics, religion and etc while in non-discussive group there will be more daily life jokes, like family, friends, celebration etc - confirmed

We were still thinking about improvements to our work and we thought it was very important to do a similar analysis of reddit posts, only with a larger sample, again I remind you that this sample was made from posts that had been published in the last 3 months. Next you can better study exactly the word polarity on the new sample, we personally liked how the polarity showed results on our date and with further study on the new sample you can learn a lot more from the date.

Thank you all for your attention and Happy New Year

Caption for the picture.

8 References

[1] Reddit: page “Jokes: Get Your Funny On!” [2] Binns, Amy. 2012. ‘Don’t Feed the Trolls!’ Journalism Practice 6(4):547–62. doi: 10.1080/17512786.2011.648988.